Microbiome Bioinformatics Snakemake workflow
…BRIEF INTRO IN PROGRESS…
A tentative snakemake workflow that defines mothur and qiime2 bioinformatics rules in a DAG (directed acyclic graph) format. A detailed interactive snakemake report is available here.
Getting started with Mothur pipeline
Create a mothur YAML file
name: mothur48
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- mothur =1.48.0
- vsearch =2.22.1
Create mothur env using the YAML file
conda activate base
conda env create -n mothur48 --file mothur48.yml
conda activate mothur48
Download references databases
- Download Silva alignment reference database.
- Creatge Silva classifier from Silva alignments.
- Download RDP classifier.
bash workflow/scripts/mothurReferences.sh
Overview of Mothur classification methods
There are four methods that can be used to profile microbial communities present in a sample. Here we briefly decribe each method:
1.Classify OTUs
- OTUs (Operational Taxonomic Units (OTUs)) are clusters of similar sequences and are commonly accepted as analytical units in microbial profiling when using 16S rRNA gene markers.
2. Classify Phylotypes
- A phylotype in microbiome research is a DNA sequence or group of sequences sharing more than an arbitrarily chosen level of similarity of a 16S rRNA gene marker.
3. Classify ASVs
- ASVs Amplicon Sequence Variants (ASVs)in microbiome research is any inferred single DNA sequences recovered from a bioinformatics analysis of 16S rRNA marker genes.
- ASV is typically really a cluster of sequences that are one or two bases apart from each other.
4. Classify Phylogenies
- Microbial phylogenies are from gene sequence homologies. Models of mutation determine the most-likely evolutionary histories.
Preliminary OTU analysis using Mothur
The preliminary analysis (alpha_beta_diversity rule) is
part of the bioinformatics analysis. It includes:
- Creating reads count for each group.
- Subsampling for downstream analysis.
- Rarefaction.
- Computing Alpha diversity metrics.
- Computing Beta diversity metrics.
- Getting sample distances.
- Constructing sample phylip tree.
- Generating ordination matrices including PCoA and NMDS.
Getting started with QIIME 2 pipeline
Get QIIME2 YAML file
wget https://data.qiime2.org/distro/core/qiime2-2023.2-py38-osx-conda.yml
Create qiime2 env and install qiime2
Current YAML file: qiime2-2023.2-py38-osx-conda.yml available
The qiime2 YAML file contains over 500 dependencies. Listed below is just a few QIIME 2 framework dependencies to get the installation started.
name: qiime2202320
channels:
- qiime2/label/r2023.2
- conda-forge
- bioconda
- defaults
dependencies:
- q2cli=2023.2.0
- qiime2=2023.2.0
- python=3.8.16
- q2-alignment=2023.2.0
- q2-composition=2023.2.0
- q2-cutadapt=2023.2.0
- q2-dada2=2023.2.0
- q2-deblur=2023.2.0
- q2-demux=2023.2.0
- q2-diversity=2023.2.0
- q2-diversity-lib=2023.2.0
- q2-emperor=2023.2.0
- q2-feature-classifier=2023.2.0
- q2-feature-table=2023.2.0
- q2-fragment-insertion=2023.2.0
- q2-gneiss=2023.2.0
- q2-longitudinal=2023.2.0
- q2-metadata=2023.2.0
- q2-mystery-stew=2023.2.0
- q2-phylogeny=2023.2.0
- q2-quality-control=2023.2.0
- q2-quality-filter=2023.2.0
- q2-sample-classifier=2023.2.0
- q2-taxa=2023.2.0
- q2-types=2023.2.0
- q2-vsearch=2023.2.0
Installing QIIME2 using a bash script
conda activate base
wget https://data.qiime2.org/distro/core/qiime2-2023.2-py38-osx-conda.yml
conda env create -n qiime2-2023.2 --file qiime2-2023.2-py38-osx-conda.yml
conda activate qiime2-2023.2
qiime info
Downloading demo data
Demo data from one of QIIME 2[1] tutorials.
mkdir -p resources
mkdir -p resources/reads
mkdir -p resources/references
cd mkdir -p resources/reads
wget \
-O "sample-metadata.tsv" \
"https://data.qiime2.org/2023.2/tutorials/atacama-soils/sample_metadata.tsv"
wget \
-O "emp-paired-end-sequences/forward.fastq.gz" \
"https://data.qiime2.org/2023.2/tutorials/atacama-soils/10p/forward.fastq.gz"
wget \
-O "emp-paired-end-sequences/reverse.fastq.gz" \
"https://data.qiime2.org/2023.2/tutorials/atacama-soils/10p/reverse.fastq.gz"
wget \
-O "emp-paired-end-sequences/barcodes.fastq.gz" \
"https://data.qiime2.org/2023.2/tutorials/atacama-soils/10p/barcodes.fastq.gz"
Download a QIIME 2 trained classifer
wget \
-O "gg-13-8-99-515-806-nb-classifier.qza" \
"https://data.qiime2.org/2023.2/common/gg-13-8-99-515-806-nb-classifier.qza"
Other classifiers also exist. Check on QIIME2 website for more information.
Overview of QIIME 2 classification methods
1. De novo clustering
Sequences are clustered against one another.
Closed-reference clustering
Here the clustering is performed at 99% identity against the Greengenes reference database.
Open-reference clustering
Here the clustering is performed at 99% identity against the Greengenes reference database.
Alignment of representative sequences
- The MAFFT (Multiple Alignment using Fast Fourier Transform) software provides alignments of the representative sequences.
- Then we will run alignment mask function to remove poor alignments.
Identifying and filtering chimeric feature sequences
References
Appendix
Project main tree
.
├── LICENSE.md
├── README.md
├── REFER
│  ├── dags
│  ├── data
│  ├── envs
│  ├── images
│  ├── imap-bioinformatics-qiime2
│  ├── imap-bioinformatics.Rproj
│  ├── index.html
│  ├── render.R
│  ├── report.html
│  ├── resources
│  ├── results
│  ├── rules_dag.sh
│  ├── smk_html_report.sh
│  └── workflow
├── config
│  ├── config.yml
│  ├── pbs-torque
│  ├── samples.tsv
│  ├── slurm
│  └── units.tsv
├── dags
│  ├── rulegraph.png
│  └── rulegraph.svg
├── data
│  ├── README.md
│  ├── logs
│  ├── metadata
│  ├── mothur
│  ├── qiime
│  ├── reads
│  ├── references
│  └── test
├── images
│  ├── bioinformatics.png
│  ├── bkgd.png
│  ├── imap_part02.svg
│  ├── imap_part03.svg
│  ├── imap_part04.svg
│  ├── imap_part05.svg
│  ├── silvaalign.png
│  ├── smkreport
│  └── sra_config_cache.png
├── index.Rmd
├── library
│  ├── apa.csl
│  ├── export.bib
│  ├── imap.bib
│  └── references.bib
├── mothur.1679667746.logfile
├── mothur.1679667810.logfile
├── mothur_process
│  ├── control.final.shared
│  ├── current_files.summary
│  ├── error_analysis
│  ├── final.count_table
│  ├── final.list
│  ├── final.rep.fasta
│  ├── final.sensspec
│  ├── final.steps
│  ├── final.tax.summary
│  ├── intermediate
│  ├── logs
│  ├── mock.final.shared
│  ├── sample.final.0.03.subsample.shared
│  ├── sample.final.braycurtis.0.03.lt.ave.dist
│  ├── sample.final.braycurtis.0.03.lt.dist
│  ├── sample.final.braycurtis.0.03.lt.nmds.axes
│  ├── sample.final.braycurtis.0.03.lt.nmds.iters
│  ├── sample.final.braycurtis.0.03.lt.nmds.stress
│  ├── sample.final.braycurtis.0.03.lt.pcoa.axes
│  ├── sample.final.braycurtis.0.03.lt.pcoa.loadings
│  ├── sample.final.braycurtis.0.03.lt.std.dist
│  ├── sample.final.braycurtis.0.03.lt.tre
│  ├── sample.final.count.summary
│  ├── sample.final.groups.ave-std.summary
│  ├── sample.final.groups.rarefaction
│  ├── sample.final.groups.summary
│  ├── sample.final.shared
│  ├── sample.final.sharedsobs.0.03.lt.ave.dist
│  ├── sample.final.sharedsobs.0.03.lt.dist
│  ├── sample.final.sharedsobs.0.03.lt.std.dist
│  ├── sample.final.thetayc.0.03.lt.ave.dist
│  ├── sample.final.thetayc.0.03.lt.dist
│  ├── sample.final.thetayc.0.03.lt.std.dist
│  ├── test.contigs_report
│  ├── test.files
│  ├── test.scrap.contigs.fasta
│  ├── test.scrap.contigs.qual
│  ├── test.trim.contigs.fasta
│  └── test.trim.contigs.qual
├── qiime2_process
│  ├── demux.qza
│  ├── demux.qzv
│  ├── feature-table-dn-99.qza
│  ├── feature-table.qza
│  ├── feature-table.qzv
│  ├── new-ref-seqs-or-85.qza
│  ├── q2-sample-metadata.qzv
│  ├── rep-seqs-cr-85.qza
│  ├── rep-seqs-dn-99.qza
│  ├── rep-seqs-dn-99.qzv
│  ├── rep-seqs-or-85.qza
│  ├── rep-seqs.qza
│  ├── rep-seqs.qzv
│  ├── stats.qza
│  ├── stats.qzv
│  ├── table-cr-85.qza
│  ├── table-or-85.qza
│  └── unmatched-cr-85.qza
├── report.html
├── resources
│  ├── 85_otus.qza
│  ├── final_fasta
│  ├── gg-13-8-99-515-806-nb-classifier.qza
│  ├── metadata
│  ├── reads
│  └── test
├── results
│  └── project_tree.txt
├── smk.css
├── styles.css
├── tree.sh
└── workflow
├── Snakefile
├── envs
├── report
├── rules
└── scripts
41 directories, 91 files
Static snakemake report
The interactive snakemake HTML report can be viewed by opening the
report.htmlusing any compatible browser. You will be able to explore the workflow and the associated statistics. You can close the left bar to get a more expansive display view.
Summary of Informative workflows
Troubleshooting of FAQs
- Question
- Question
-
Answer
-
Answer